Tests: move `generate` tests to the right mixin and delete redundant tests by gante · Pull Request #34464 · huggingface/transformers

gante · 2024-10-28T11:21:49Z

Moves generate tests incorrectly placed in the general mixin to GenerationTesterMixin. In the process, removes redundant tests and streamlines repeated logic 👀

⚠️ reviewer: start by reviewing test_modeling_utils.py, the last file in the diff. In it, I explain what happened to each test and why -- most were actually deleted, as they were redundant.

After this PR:
✅ fewer redundant tests and less code duplication
✅ fewer flaky tests
✅ faster tests
✅ more test coverage

In a follow-up PR:
👉 Fix failing upgraded generate+FA2 test
👉 Fix failing added generate+torch.compile test

Closes #32913

HuggingFaceDocBuilderDev · 2024-10-28T11:49:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante · 2024-10-28T14:53:12Z

src/transformers/generation/utils.py

        # Exception 1: when passing input_embeds, input_ids may be missing entries
        # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here
-        # Exception 3: with synced GPUs cache_position may go out of bounds, but we only want dummy token in that case
+        # Exception 3: with synced GPUs cache_position may go out of bounds, but we only want dummy token in that case.


(changes in this file were done for end-to-end compilation)

gante · 2024-10-28T14:55:28Z

tests/generation/test_utils.py

+    def _check_similar_generate_outputs(self, output_1, output_2, atol=1e-5, rtol=1e-5):
+        """
+        Checks whether a pair of generate outputs are similar. Two `generate` call outputs are considered similar in
+        the following siturations:
+        1. The sequences are the same
+        2. The sequences are different, but the scores up until the first mismatch are nearly identical
+        """


Many generate tests wanted to check equivalence of something, and were incorrectly tagged as flaky. This function contains the correct equivalence check -- as a result, several @is_flaky were removed :D

(I honestly don't know why I haven't implemented this before, instead of adding @is_flaky 🙃 )

gante · 2024-10-28T14:57:23Z

tests/generation/test_utils.py

                self.assertIsInstance(output_generate, GreedySearchDecoderOnlyOutput)

-            self._check_outputs(output_generate, main_input, model.config)
+            self._check_outputs(output_generate, model.config)


We actually don't need main_input, we can obtain the batch size directly from the many outputs of generate

gante · 2024-10-28T14:58:24Z

tests/generation/test_utils.py

+    def test_generate_from_inputs_embeds(self):
+        """Tests that we can generate from `inputs_embeds` instead of `input_ids` in LLMs, VLMs, etc"""


Related to the deleted inputs_embeds tests -- clarifies that this test also runs for VLMs

gante · 2024-10-28T15:00:27Z

tests/generation/test_utils.py

+        Tests that generating with static cache give almost same results as with dynamic cache, and the output cache
+        has the expected shapes


Merges this existing test, which checked the static cache returned by generate, with the deleted static cache equivalence test.

One generate call is enough to run both checks :)

ydshieh · 2024-10-28T15:01:50Z

For tests where is_flaky is removed, you ran it with --flake-finder --flake-runs 500 right? (at least for some models ?)

gante · 2024-10-28T15:03:40Z

tests/generation/test_utils.py

+            ("end_to_end", True),  # TODO (@joao): end-to-end compilation is broken with torch 2.5+, explore and fix
+        ]
+    )
+    def test_generate_compile(self, name, end_to_end):


Instead of having a test for torch.compile against large checkpoints, let's do it against dummy checkpoints -- the importance of the test is to confirm a) we can compile AND b) compiled is equivalent to non-compiled.

Testing model.forward compilation is very similar to testing model.generate compilation, so parameterized is used. A few modifications were added to make the test more efficient (e.g. compile only once)

Note: this test doesn't have to be slow, but we currently have a few failures. Let's fix them in a follow-up PR

gante · 2024-10-28T15:06:13Z

tests/generation/test_utils.py


-    @require_torch_sdpa
-    @slow
-    def test_eager_matches_sdpa_generate(self):


The test to check that SDPA = eager is the same as the test to check that FA2 = eager. However, parameterized doesn't work here: we want different pytest.mark decorators for each test. As such, the original test was moved to a helper function.

gante · 2024-10-28T15:08:52Z

tests/test_modeling_common.py

                    )[0]
            self.assertTrue(torch.allclose(out_embeds, out_ids))

-    def test_inputs_embeds_matches_input_ids_with_generate(self):


We already had a test that checks that generating from inputs_embeds is equivalent to generating from input_ids -- test_generate_from_inputs_embeds

This test was deleted.

one q: afaik the other test is skipped for encoder-decoder models? Can we enable it just to be sure we cover all models with generation abilities?

Good point, thank you for catching that! I'll update test_generate_from_inputs_embeds so as to support VLMs

✅ test_generate_from_inputs_embeds updated for VLMs, overwrites deleted. @zucchini-nlp have a look at the updated test -- you might need to add VLMs to the list of added special cases in the test

gante · 2024-10-28T15:13:55Z

tests/test_modeling_common.py

-    @mark.flash_attn_test
-    @slow
-    @is_flaky()
-    def test_flash_attn_2_generate_left_padding(self):


We have:

a test that checks that we can generate with left-padding (test_left_padding_compatibility)

a test that checks that we can generate with FA2 (test_eager_matches_fa2_generate, added in this PR)

a test that checks that we can run the forward pass in FA2 with left-padding (test_flash_attn_2_inference_equivalence)

As such, this test was redundant. From generate's point of view, FA2 is a modification to model.forward, so if model.forward is equivalent then so is model.generate

Deleted.

gante · 2024-10-28T15:16:06Z

tests/test_modeling_common.py

-    @mark.flash_attn_test
-    @is_flaky()
-    @slow
-    def test_flash_attn_2_generate_padding_right(self):


Similar as above (we have individual checks in place). Moreover, generating with right-padding doesn't make sense on most models :D

Deleted.

gante · 2024-10-28T15:17:43Z

tests/test_modeling_common.py

-    @require_torch_gpu
-    @mark.flash_attn_test
-    @slow
-    def test_flash_attn_2_generate_use_cache(self):


This test was pretty toothless -- it didn't even check the actual output of generate.

I've added test_eager_matches_fa2_generate in its place, which checks the outputs. It's a clone of the SDPA generate equivalence test, except that it uses FA2.

gante · 2024-10-28T15:19:06Z

tests/test_modeling_common.py

-    @require_torch_gpu
-    @mark.flash_attn_test
-    @slow
-    def test_flash_attn_2_generate_reuse_cache(self):


Similar to the FA2 + padding tests above.

In this case, we check that we can reuse a cache in test_generate_continue_from_past_key_values, no need for a FA2-specific test.

Deleted.

gante · 2024-10-28T15:20:04Z

tests/test_modeling_common.py

            normalized_1 = F.softmax(out_shared_prefix_last_tokens)
            torch.testing.assert_close(normalized_0, normalized_1, rtol=1e-3, atol=1e-4)

-    def test_static_cache_matches_dynamic(self):


Merged the checks in this test [check that the outputs are the same] with the checks in test_generate_with_static_cache [check that the shapes are as expected]

gante · 2024-10-28T15:21:49Z

tests/test_modeling_common.py

-    @slow
-    @require_torch_accelerator
-    @require_read_token
-    def test_torch_compile(self):


This test was running against very large checkpoints, and we had to manually specify one in _torch_compile_test_ckpt 👀

We were indeed missing a mixin test for generate with model.forward compiled, see the new test_generate_compile. The most important part of this test is to confirm that compiled == not compiled, we have checks for other parts of the problem (including checks to see whether large checkpoints are working as expected)

This heavy test was, in practice, deleted

gante · 2024-10-28T15:22:23Z

tests/test_modeling_common.py

-    @slow
-    @require_torch_gpu  # Testing cuda graphs.
-    @require_read_token
-    def test_compile_cuda_graph_time(self):


Same as above (uses heavy checkpoints)

We now have a benchmarks page, much better to track performance than a test.

gante · 2024-10-28T15:26:36Z

For tests where is_flaky is removed, you ran it with --flake-finder --flake-runs 500 right? (at least for some models ?)

@ydshieh yes :D The pattern was actually added in a recent PR: #34386

zucchini-nlp

Great clean up, good to see redundant code killed! Left a few questions and also I think we had test_embeds_match_ids_generate overwritten in all VLMs so prob those can be removed now

Overall LGTM, thanks

EDIT: forgot to mention that a few models might not be running the new tests because they don't have GenerationTesterMixin. Working on adding that on a few VLMs currently and noticed that tests from here might fail for them. Guess it's okay, i'll skip/fix the failing ones after enabling generation tests

zucchini-nlp · 2024-10-29T07:09:20Z

tests/test_modeling_common.py

                    )[0]
            self.assertTrue(torch.allclose(out_embeds, out_ids))

-    def test_inputs_embeds_matches_input_ids_with_generate(self):


one q: afaik the other test is skipped for encoder-decoder models? Can we enable it just to be sure we cover all models with generation abilities?

zucchini-nlp · 2024-10-29T07:14:14Z

tests/test_modeling_common.py

-                # Generate with one batch only to test generation when attention mask will be None
-                # when real inputs are used, because there is no padding. See issue #32237 for more
-                dummy_input = dummy_input[:1, ...]
-                dummy_attention_mask = torch.ones_like(dummy_attention_mask[:1, ...])
-                _ = model.generate(
-                    dummy_input,
-                    attention_mask=dummy_attention_mask,
-                    max_new_tokens=max_new_tokens,
-                    do_sample=False,
-                    use_cache=True,
-                )


This was testing that FA2 with packing works in generate, and unfortunately we don't have equivalent forward test. We could extend the test_fa2_position_ids test with this specific case

~~Makes sense, I'll make sure there is one packing FA2 test (testing the forward pass, and not generate -- generate does not support packing)~~

@zucchini-nlp actually test_flash_attn_2_fp32_ln covers the case without attention mask, no need to add more test cases

ydshieh · 2024-10-29T11:20:21Z

tests/generation/test_utils.py

    @slow
-    @is_flaky()  # compilation may result in equivalent (!= same) FP ops, causing the argmax in `generate` to be flaky
-    def test_generate_compile_fullgraph(self):
+    def test_generate_compile(self, _, end_to_end):


RUN_SLOW=1 python3 -m pytest -v tests/models/llama/test_modeling_llama.py -k "test_generate_compile"

gives me

ling_llama.py::LlamaModelTest::test_generate_compile_1_end_to_end - RuntimeError: Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run. Stack trace: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/external_utils.py", line 38, ...

@ydshieh have a look at the TODO in the @parameterized of this test -- both parameterizations have broken cases, whose cause comes from outside this PR.

This PR is already quite big, I'd rather fix the root causes of both parameterizations in a follow-up PR 🤗 This PR moves the tests to the right place, and doesn't touch modeling code

ArthurZucker

Long due! Thanks everyone for reviewing and @gante for the cleanup!

…tests (huggingface#34464) * tmp commit * tmp commit * cull overwrites of deleted tests * typo * more specific docstring * make fixup * parameterize at the top? * correction * more deletions :D * tmp commit * for VLMs too * fix _check_outputs * test nit * make fixup * fix another flaky * test_generate_from_inputs_embeds -- handle missing attention mask

gante added 2 commits October 25, 2024 17:23

tmp commit

6eefb94

tmp commit

06d2048

cull overwrites of deleted tests

9b8d9b2

gante commented Oct 28, 2024

View reviewed changes

typo

7feeeca

gante commented Oct 28, 2024

View reviewed changes

gante requested review from ydshieh and zucchini-nlp October 28, 2024 15:22

gante changed the title ~~Move generation tests~~ Tests: move generate tests to the right mixin and delete redundant tests Oct 28, 2024

more specific docstring

a1959f8

gante added 3 commits October 28, 2024 15:27

make fixup

e84177e

parameterize at the top?

ad8ebae

correction

def0940

zucchini-nlp approved these changes Oct 29, 2024

View reviewed changes

gante and others added 3 commits October 29, 2024 09:36

Merge branch 'main' into move_generation_tests

69081fa

more deletions :D

a037037

tmp commit

02a32fb

ydshieh reviewed Oct 29, 2024

View reviewed changes

for VLMs too

08a59b2

ydshieh approved these changes Oct 29, 2024

View reviewed changes

gante added 4 commits October 29, 2024 12:24

fix _check_outputs

9b17c4b

test nit

4a24d7e

make fixup

e50df32

fix another flaky

81443f5

ydshieh mentioned this pull request Oct 29, 2024

Support Kosmos-2.5 #31711

Merged

test_generate_from_inputs_embeds -- handle missing attention mask

4fd197f

gante requested a review from ArthurZucker October 29, 2024 14:32

ArthurZucker approved these changes Oct 30, 2024

View reviewed changes

gante merged commit 8a734ea into huggingface:main Oct 30, 2024

gante deleted the move_generation_tests branch October 30, 2024 10:59

gante mentioned this pull request Oct 31, 2024

Test: generate with torch.compile(model.forward) as a fast test #34544

Merged

ydshieh mentioned this pull request Nov 28, 2024

Make test_generate_with_static_cache even less flaky #34995

Merged

ydshieh mentioned this pull request Jun 10, 2025

MoshiIntegrationTests started to fail after #34464 #38725

Closed

manueldeprada mentioned this pull request Jul 7, 2025

Fix slow test_moshika_greedy_unconditional_fp16 #39251

Open

		def test_generate_from_inputs_embeds(self):
		"""Tests that we can generate from `inputs_embeds` instead of `input_ids` in LLMs, VLMs, etc"""

		Tests that generating with static cache give almost same results as with dynamic cache, and the output cache
		has the expected shapes

Conversation

gante commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Oct 28, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ydshieh commented Oct 28, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante commented Oct 28, 2024

Uh oh!

zucchini-nlp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

gante commented Oct 28, 2024 •

edited

Loading

gante Oct 28, 2024 •

edited

Loading

gante Oct 28, 2024 •

edited

Loading

gante Oct 28, 2024 •

edited

Loading

zucchini-nlp left a comment •

edited

Loading

gante Oct 29, 2024 •

edited

Loading

gante Oct 29, 2024 •

edited

Loading